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Abstract. We introduce synchronizing objectives for Markov decision processes (MDP). Intuitively, 
a synchronizing objective requires that eventually, at every step there is a state which concentrates 
almost all the probability mass. In particular, it implies that the probabilistic system behaves in the 
long run like a deterministic system: eventually, the current state of the MDP can be identified with 
almost certainty. 

We study the problem of deciding the existence of a strategy to enforce a synchronizing objective 
in MDPs. We show that the problem is decidable for general strategies, as well as for blind strategies 
where the player cannot observe the current state of the MDP. We also show that pure strategies are 
sufficient, but memory may be necessary. 



1 Introduction 



A Markov decision process (MDP) is a model for systems that exhibit both probabilistic and nondeter- 
ministic behavior. MDPs have been used to model and solve control problems for stochastic systems 
where the nondeterminism represents the freedom of the controller to choose a control action, while the 
probabilistic component of the behavior describes the system response to control actions. MDPs have 
also been adopted as models for concurrent probabilistic systems, probabilistic systems operating in open 
environments [7], and under-specified probabilistic systems |4). 

Traditional objectives for MDP specify a set S of paths, where a path is an infinite sequence of states 
through the underlying graph of the MDP. The value of interest is the probability that an execution of the 
MDP under a given strategy belongs to S. For example, a reachability objective specifies all paths that 
visit a given target state I. A typical qualitative question is to decide whether there exists a strategy such 
that a given state i is reached with probability 1. 

In this paper, we consider a different type of objectives which specify a set of infinite sequences X = 
Xo,Xi,... of probability distributions over the states [6]. Intuitively, the distribution X, in the sequence 
gives for each state £ the probability X,(£) to be in state £ at step i > 0. We introduce synchronizing 
objectives which specify sequences of distributions in which the probability tends to accumulate in a 
single state. We use the infinity norm as a measure of the highest peak in a probability distribution Xj 
(i.e., \\Xi\\ = max.£ e r,Xj(£)) and we require that the limi{^]of this measure in the sequence is 1. Intuitively, 
this requires that in the long run, the MDP behaves like a deterministic system: from some point on, at 
every step i there is a state £\ which accumulates almost all the probability. Note that satisfying such an 

*This work has been done in the MoVES project (P6/39) which is part of the IAP-Phase VI Interuniversity Attraction Poles 
Programme funded by the Belgian State, Belgian Science Policy. 

1 Since the limit may not exist in general, we actually consider either liminf or limsup. 
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objective implies that there exists a state £ which is reached with probability 1. The converse does not 
hold because reachability objectives do not require the visits to the target state to occur after the same 
number of steps in (almost) all executions of the MDP. We consider the problem of deciding if a given 
MDP is synchronizing for some strategy, We consider the general case where memoryful randomized 
strategies are allowed, as well as the special case of blind strategies which are not allowed to observe the 
current state of the MDP. 

Defining objectives as a sequence of probability distributions over states rather than a distribution 
over sequences of states is a change of standpoint in the traditional approach to MDP verification. Up to 
our knowledge, there are very few works in this setting. We are aware of the work in [6] which studies 
MDPs as generators of probability distributions with applications in sensor networks and dynamical 
systems, and shows that the resulting objectives are not expressible in known logics such as PCTL* HJ|4). 
In their definition, probability distributions over states are assigned a vector v G {0, 1}* of truth values 
for a finite set of predicates <pi,...,<p^ (which are linear constraints on the probabilities such as (p(X) = 
X(£) + X(£') < \, for example). This can be viewed as a coloring of the probability distributions using 
a finite number of colors, and then objectives are languages of infinite words over the finite alphabet of 
colors. It is shown that reachability of a given color is undecidable for MDPs if arbitrary linear predicates 
are allowed [ 6 ] . A decidability result is obtained if only predicates of the form ^ fer X {£) > are allowed. 
Synchronizing objectives cannot be expressed in the framework of using finite colorings as they 
require a real- valued measure (namely, the infinite norm) to be assigned to the probability distributions. 

In (2), the monadic logic of probabilities is introduced as a predicate logic which can express proper- 
ties of sequences of probability distributions. But because it allows comparison of probabilities only with 
constants, it cannot express synchronizing objectives which would require a quantification over proba- 
bility thresholds, such as (p(X) = V£ > • EL/V- V/ >N-3££L: Xj{£) > 1 — £, where X { is the probability 
distribution in position i in the sequence X. 

Synchronizing objectives generalize the notion of synchronizing words. In a deterministic finite 
automaton, a word w is synchronizing if reading w from any state of the automaton always leads to the 
same state. It is sufficient to consider finite words, and it is conjectured that if a synchronizing word 
exists, then there exists one of length at most (n — I) 2 where n is the number of states of the automaton, 
known as the Cerny's conjecture. Several works have studied this conjecture and related problems (see 
the survey in [HI). Viewing deterministic automata as a special case of MDP where all transitions have 
only one successor, a synchronizing word can be seen as a blind strategy to ensure a synchronizing 
objective. Note that we do not present a generalization of Cerny's conjecture since in our case, strategies 
for MDPs are infinite objects. However, synchronizing objectives provide an extension of the design 
framework for the many applications of the theory of synchronizing words, such as control of discrete 
event systems, planning, biocomputing, and robotics [8]. For example, in probabilistic models of DNA 
transcription, one may ask which molecules to introduce in a cell in order to bring it to a single possible 
state EH. 

We prove that it is decidable to determine if a given MDP is synchronizing for some strategy, either 
blind or general. We use variants of the subset construction in the underlying graph of MDPs to obtain a 
decidable characterization of synchronizing strategies. Our results imply that pure strategies are sufficient 
to satisfy a synchronizing objective, but we provide an example showing that memory may be necessary, 
both with blind and general strategies. 
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2 Definitions. 

A probability distribution over a finite set S is a function d : S — > [0, 1] such that J2 se sfi?(s) = 1- The 
support of c? is the set Supp(<i) = {s G 5 | d(s) > 0}. ^(5) denotes the set of all probability distributions 
on S, and £P(S) the power set of S. 

Markov decision processes. A Markov decision process (MDP) is a tuple M = (L,/Jo,E, 5) where L 
is a finite set of states, /j.q G &{L) is an initial probability distribution over states, £ is a finite set of 
actions, 5 :Lx£-> &(L) is a probabilistic transition function that assigns to each pair of states and 
actions, a probability distribution over successor states. A Markov chain is a special case of MDPs with 
only one action (|E| = 1). Markov chains are therefore generally viewed as a tuple M = (L,/io, 8) where 
8 : L — > &{L). For an action d£l and a state £ G L, let Post a (£) = Supp(5(£, a)), and for a set 5 C L, 
let Post a (s) = U ees Post a (£). 

Example Figure [jja) shows an MDP with four states and alphabet E = {o\ , 02}. The initial probability 
distribution is jUo(l) = 1 and Ho(i) = for i G {2,3,4}, and the probabilistic transition function 8 in 
state 1 is such that 5(1, ffi)(2) = 5(1, ffi)(3) = 1/2 and 5(l,a 2 )(l) = 1. 

We describe the behavior of an MDP as a one-player stochastic game played for infinitely many 
rounds. In the first round, the game starts in state £ with probability }JLo(£). In each round, if the game 
is in the state £ and the player chooses the action a G £, then the game moves to the successor state £' 
chosen with probability 8{£,a){£'), and the next round starts. We consider two versions of this game. 
In both versions, the player knows the structure of the MDP. In the first version the player has perfect 
information, he can see the current state of the game; in the second version the player is blind, he is not 
allowed to observe the current state of the game, and only knows the number of rounds that have been 
played so far. 

A play of the game is an infinite sequence of interleaved states and actions 71 = £o&o£\ ■ ■ ■ such that 
£i + \ G Post Gj (£i) for all i > 0. The set of all plays over M is denoted by Plays(M). A finite prefix 
h = £oOo£i ■ ■ ■ o n -\£ n of a play n is called a history, the last state of h is Last(/i) = £ n , the i th action and 
state of the of h is Action(/j, i) = a, and State(/j, i) = £[, and its length is \h\ = n. The set of all histories 
of plays is denoted by Hists(M). 

Strategies and outcome. In the game, the choice of the action is made by the player according to a 
strategy. Depending on what the player can observe and record, he can use various classes of strategies. 
A randomized strategy (or simply a strategy) over an MDP M is a function a : Hists(M) — > f^(£). 
A pure (deterministic) strategy is a special case of randomized strategy where for all h G Hists(M), 
there exists an action a G L such that a(h)(a) = 1. A memoryless strategy is a randomized strategy 
a such that a(hi) = a{h2) for all h\,h2 G Hists(M) with Last(/ji) = Last(/j2). In this last case, the 
player cannot record the history of the play and makes a choice according to the current state only. 
For convenience, we view pure strategies as functions a : Hists(M) — > E, and memoryless strategies as 
functions a : L — > f^(E). Hence, a pure memoryless strategy is a function a : L — > E. 

A strategy a is blind if a{h\) = a(/z2) for all h\,h2 G Hists(M) such that \h\ \ = \h,2\. Blind strategies 
can be viewed as functions a : N — > ^(E) (or, a : N — > E for pure blind strategies) which assign in each 
round a probability distribution over actions. Sometimes we talk about perfect-information strategies to 
emphasize when we consider strategies that are not necessarily blind. 
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The outcome of the game played on an MDP M = (L,jUo,£, 8} using a strategy a is the infinite 
sequence X"X® ... of probability distributions over the set of states L, where X" = {J.q and for all n > 0, 

XnW = LfteHists(M):Last(A)=£,|/i|=rc-P r0! (M 

where the probability Pr a (h) of a history /j = IqGqI\ ■ ■ ■ G n -\l n under strategy a is 
Pr« (h) = nc (4>) • n "=i « (4>ob • • • 4- i)(o>-i) -5 j , Oj ! ) (^) . 

Synchronizing objectives. The norm of a probability distribution X over L is ||X|| = max^X^). We 
say that the MDP M with strategy a is strongly synchronizing if 

liminf||X,f || = 1, (1) 

n— 

and that it is weakly synchronizing if 

limsup||Z n a || = 1. (2) 

n— 

Intuitively, an MDP is synchronizing if the probability mass tends to concentrate in a single state, 
either at every step from some point on (for strongly synchronizing), or at infinitely many steps (for 
weakly synchronizing). Note that equivalently, M with strategy a is strongly synchronizing if the limit 
lim„_ >0 o||.X'"|| exists and equals 1. In this paper, we are interested in the problem of deciding if a given 
MDP is synchronizing for some strategy. We consider the problem for both perfect-information and 
blind strategies. 

Recurrent and transient states. A state £' G L is accessible from a state ££L (denoted I — > £'), if there 
is a history h = £oOq£i • • • o»- A with £q = i and £ n = £'. If both and £' — > £ hold, then we say that 

I and £' are strongly connected (denoted £ <-)■ £'). This induces an equivalence relation called accessibility 
relation. An MDP is strongly connected, if all pairs of states £,£' G L are strongly connected. A state 
accessible from a state of Supp(/io) is simply called accessible state. 

For a Markov chain M, the state £ is recurrent if all accessible states from I can access £ (i.e., i and 
£' are strongly connected for all £' such that and the state £ is transient if there exists some state 

£' such that £' is accessible from i, but £ is not accessible from £'. The next proposition follows from 
standard results ll5l . 

Proposition 1 Given a Markov chain M, let Xq,Xi,... be the sequence of probability distributions ofM. 
Then lim sup^^X" (I) = Ofor all transient states £€L, and lim sup,,^^ X n {£) > Ofor all recurrent states 

£eL. 

Subset constructions. We define two important constructions based on the subset construction idea. 
Subset construction is a standard technique to compute, from a nondeterministic finite automaton N, an 
equivalent deterministic automaton D (for language equivalence), where one state of D corresponds to 
the set of possible states (called a cell) in which Af can be. We define two kinds of subset constructions 
on MDPs, the perfect-information subset construction, and the blind subset construction. As usual, each 
state of the subset constructions is a subset of states of the MDP (i.e., a cell). In our case, the main 
difference lies in the alphabet. In the perfect-information subset construction, the selection of the next 
action depends on the current state (each state of a cell can independently choose an action), while in the 
blind subset constructions the next action is independent of the state (all states of a cell have to choose 
the same action). Thus, an action in the perfect-information subset construction is a function a : L — > £ 
which assigns to each state £ G L its choice among the actions in E. 
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Definition 1 (Perfect-information subset construction of an MDP) For an MDP M = (L, jUo, £, 5) , 
the perfect-information subset construction is an automaton M p = (Jzf ,L/,£, 8 P ) where Jz? = 0P(V) 
\{0}, Lj = Supp(/io), £ = {d I a :L— >L} is the alphabet, and S p : Jzf x £— ► J5f where for all si,s 2 G Jzf 
ant/ d G £, we /jave 5 p (si,d) = 52 where S2 = Ug eS] Post^^(£). 

Example Figure [TJb) shows the perfect information subset construction M p of the MDP drawn in Fig- 
ure [TJa) (presented in the first example). Let us present £ in the table below. Each row labelled by a 
function d/ (i £ {!,..., 1 1}), each column labelled by a state £; and each entry shows the value of 
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2 


3 
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di 


02 


{^1,02} 


{^1,^2} 


{01 


,^2} 


d 2 


<5\ 


{^1,0-2} 
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d 3 
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ai 


{01 


,0-2} 


d 4 






02 


{<* 


,o 2 } 


d 5 


{^1,02} 


(7 2 


02 


{<* 


,o 2 } 




{cti,ct 2 } 


ai 


<?i 


{ai 


,02} 


d 7 




{tfi,d>} 


{^1,02} 


{ai 


,d>} 


d* 




ai 


{^1,02} 


{ai 


,^2} 


d 9 






a 2 


{ai 


,o 2 } 


dio 


{cti,o- 2 } 


{CTi,(T 2 } 




{<* 


,02} 


dn 


{cti,o- 2 } 


02 


{^1,02} 


{<* 


,02} 



Note that, the function d with d(^) = {(J\ , (7 2 } (for a state £) gives two different functions where &i(£) = 
Oi and dj(£) = a 2 ; but these two functions behaves similarly. 

A cycle of M p is a finite sequence C p = so GqSi . . . Sd- \ &d- 1 *d of interleaved cells and symbols such 
that S p (sj,Sj + i = dj) for all < j < d, and so = s<j. Note that, in this definition, d is the length of the 
cycle C p . We write s G C p if s is one of the cells sj (0 < j < d) of the finite sequence of the cycle C p . 
A simple cycle is a cycle where all cells sq, . . . ,Sd-\ are different. We are interested in defining some 
property on cycles of the perfect-information subset construction for a given MDP. 

Definition 2 (Recurrent cyclic sets) Let C p = so&o ... Sd-\ 6d-\Sd be a cycle of the perfect-information 
subset construction M p for a given MDP M. A recurrent cyclic set for the cycle C p is a sequence 
G = gog\ . . .gd such that go = gd, and 7^ gj C s ; and U( egi Post &j ^ {£) = gj + \ for all <i < d. 
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0~1, CT 2 



Figure 2: (a) shows an MDP, and (b) shows some part of its perfect information subset construction. 

A cycle C p might have several recurrent cyclic sets. A recurrent cyclic set G for a given cycle C p , is 
said to be minimal if there is no other recurrent cyclic set G' (G ^ G') such that for < i < d, and for 
gi G G, g'j G G', we have g'- C g t . We denote the set of all minimal recurrent cyclic sets of the cycle C p 
by A(C P ) = {G | G is a minimal recurrent cyclic set for the cycle C p }. 

Example Consider the MDP M in Figure [2] (the initial distribution is jUo(l) = 1 and jUo(f) = for i G 
{2,..., 9}). Figure |2jb) shows one cycle of the perfect information subset construction M p . Let us 
present £ in the table below. Each row labeled by a function a, (/ G { 1 , . . . ,4}), each column labeled by 
a state I; and each entry shows the value of &i{€). 
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3 


4 


5 
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{oi,o 2 } 


{oi,o 2 } 


{01,02} 
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{oi,o 2 } 


{01,02} 
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{01,02} 


{01,02} 


{oi,o 2 } 


o 2 


{01,02} 


{01,02} 


j>l,<T 2 } 


{01,02} 


64 


{oi,o 2 } 


{oi,o 2 } 


{oi,o 2 } 


{oi,o 2 } 


{oi,o 2 } 


{01,02} 


{oi,o 2 } 


{CTi,<T 2 } 


{OuOl} 



For the cycle C p = {2,5,8} 62 {3,5,6} 63 {4,7,9} 64 {2,5,8}, the set of minimal recurrent cyclic 
sets is A(C P ) = {{{2}, {3}, {4}}, {{5}, {6}, {7}}}. The elements of A (C p ) are not comparable. 

The blind subset construction for an MDP is a special case of its perfect information subset construc- 
tion where the action functions a G £ are restricted to constant functions. In each cell, all states have to 
choose the same action. 



Definition 3 (Blind subset construction of an MDP) The blind subset construction for a given MDP 
M = (L,^o,E,5) is an automaton M B = (if <5 B ) where J2? = &(L) \{0}, U = Supp(/i ), and for 
all s\,s 2 G =Sf and O G T, we have 5 B (si, d) = S2 where s 2 = Post a {s\). 

We denote cycles in the blind subset construction by C B . 

3 Synchronizing Objectives for Perfect-Information Strategies 

We have defined a perfect-information one -player stochastic game in which the player can see the current 
state of the game and record the sequence of visited states. We show that synchronizing strategies can be 
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Figure 3: An MDP where memory is necessary to win the strongly synchronizing objective. 



characterized in the perfect-information subset construction, giving a decidability result. We also show 
in the next example that memory may be necessary. 

Example Consider the MDP M in Figure [3] (the initial distribution is jUo(l) = 1 and /io(0 = f° r i S 
{2,..., 5}), and let a be the strategy defined as follows: ce((L x L)*£)(o) = 1/2 for all a G L and 
I G {1,3,4,5}, and for the histories ending in the state 2, 

if i = 1 and a = C2, 
oc{(LxL)*£L2)(o)= { 1 if £ ^ 1 and a = oj, 

otherwise. 

In this example, it is easy to check that the strategy a is strongly synchronizing. In state 2, it plays 
G\ and G2 hi alternation in order to ensure synchronization with the cycle 3,4,5 of length 3. However, 
none of the memoryless strategies is strongly synchronizing, showing that memory is necessary. This 
example also shows that memory is necessary for weakly synchronizing objective, as well as for blind 
strategies. 

Proposition 2 For both strongly and weakly synchronizing objectives, memoryless strategies are not 
sufficient in MDPs. 

Theorem 1 For a perfect information game over an MDP M, there exists a strategy a such that M with 
strategy a is strongly synchronizing, if and only if the perfect-information subset construction M p forM, 
has an accessible cycle C p such that |A(C P )| = 1, and for G G A(C P ) and for all g G G, \g\ = 1. 

Proof Sufficient condition. We suppose that the perfect-information subset construction M p for M, has 
an accessible cycle C p = sq Oq ...Sd such that |A(C P )| = 1, and for G G A(C P ) and for all g G G, we 
have \g\ = 1. Since this cycle is accessible, there exists a finite path P = pqGqP\ ■ ■ .p m -\G' m _ x p m in M p 
from pa = Li to p m = sq = sj (See Figure|4]). Consider the pure strategy a as follows 



o((IxE) 



a' k (£) ifO<k<m, 
&(k-m) mod difi) i f m<k. 



Let us construct a finite Markov chain M' in a way that its long term behavior simulates the long 
term behavior of the MDP M under the strategy a for synchronizing objectives. This Markov chain is 
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c m-2 



Pm-1 



Sd-1 




Figure 4: An accessible cycle C of M which is reachable by a finite path po, . . . ,p m . 

M' = (L',jUq,5') where L' = {(i,£) | < i < (m + d) and £ G L}, the initial distribution n' is defined as 
follows 



The idea is that each cell pi (0 < i < m) of the path P and, similarly, each cell s, (m < i < m + d) of 
the cycle C p corresponds to |L| states in the Markov chain M (one for each state of the MDP M). The 
value of S'((i,£))((i' ,£')) shows the probability to reach in one step, the state (i',£') from the state (i,£); 
semantically it gives the probability to go from £ to £' at step i. We show that (a) if the Markov chain 
M' is strongly synchronizing, then so is the MDP M under the strategy a and that (b) M' is strongly 
synchronizing. 

Proving (a) is straightforward from the definition of the Markov chain M'. Each state of the MDP M 
corresponds to m + d state of M' . Then if, from some point, the mass of probability accumulates in one 
state of M' and afterward moves totally to another one, it happens also in M. In detail, let the sequence 
Xf (i £ N) denote the outcome of the MDP M under the strategy a, and X[ (i G N) denote the probability 
distribution at step i generated by the Markov chain M' . Note that X a is a random variable over \L\ 
entries, but X' is over \L\ • (m + d) entries which has at most \L\ non-zero entries. Let us compute and 
compare the non-zero entries of these two random variable sequences. For £ G L: 



X<*(£) = hq(£) = X^((0,£)) and we have X' Q ((j,£)) = for all j / 0. 

W =Lt>eLlM>{£')-S{£'M£')){£) =^ eL ^{£')-8(£',c' Q {£')){£) =Le eL ^{£')-8'{{<d,£')){{\,£)) = 
X[{(\,£)) and we have X[ ({j, £)) = for all j^l. 



Ewi,..4 lG LMo(4)-5(4,«(4))(^i)-5(4,a(4 £i))(£ 2 )----8{£ i - 1 M^ a (4) h.-A-iW) = 
UA,..MM£o)-8(£o,^(£o)m 




and the probability transition function 5' is defined as follows 



S'((i,£)W,t')) = < 



' 5(£,o!{£))(£') if (0 < i < m), (i' = i + l), (£ e Pi ) and {£' G p f ), 
S(£,&i- m (£))(£') if (m<i<m + d), (?' = m+(i-m+\) mod d), 
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We, also, have X-((j,£)) = for all j ^ i, these results give ||X ; a || = ||X/|| for i < m. At the end, 
consider i>m: 

W) = EM 1 ,..4_ ie z.Mo(4>) • S(4,a(4))(4) • 8(£ h a(l a(4) h)){£ 2 ) ■■■■ 

S(4_i,a(4 A • • -4-i))W = Le ,i u -A 1& lMo(4) • 5(4,<W)(4) • S(4,£i(4))(^) • • • • 

(— m) mod ii 

(i — m) mod d,£j-i))((m+ (i — m) mod d,£)) = X/((m + (/ — m) mod d,£)). 

We, also, have X-((j,£)) = for all j ^ i, this results give ||X"|| = ||X/|| for i > m. We have shown 
that Xf {£) = X/ ( (j , t) ) where for < i < m, we have j = i, and for i > m, we have j = m+(i — m) mod d. 
This simply gives ||X"|| = ||X/|| for i G N; meaning that if the Markov chain M' is synchronizing, so is 
the MDP M under the strategy a. 

To show (b), we study transient and recurrent states of the Markov chain M'. Suppose that G G A(C P ) 
is the only recurrent cyclic set of the cycle, and it includes d elements as go, . . .gd-i- Let R be the set of 
states (m + i, €) such that I 6 gu for <i <d. We claim that the states of R are the only recurrent states 
in the Markov chain M'. 

• First, we can see that the states of R are recurrent. By construction, the states of R are strongly 
connected. In addition, we have to prove that if (m + i,£) G R and (m + i,£) — > (m + j,£ r ), then 
(m + j,£') G R. This holds by induction on the equality (j£ Sgl Post a .m(£) = gi+\. Note that (m + 
i, £) G R implies that £ G gf, and if (m + i, £) — > (m + j, £') then has to lie in gj. 

• Now, we show that the states of R are the only recurrent states. By contradiction, suppose that 
there is another set R' of recurrent states in the Markov chain M' . By Proposition [T] and since the 
states (/, £) (0 < i < m) are visited only once, then they could not be recurrent; therefore we discuss 
only on the states (m + i, £) with < i < d of the Markov chain M'. Let g • denote all states included 
in {^| (m + i, £) G R'} Pi Si for < i < d. The construction of the Markov chain implies that a state 
(m + i, £) can only have outgoing edges toward some states (m + (i + 1 ) mod d,£'); hence g- / for 
all <i <d. On the other hand, the definition of recurrent states requires that each accessible state 
from (m + i, £) G R' could access (m + i, £), therefore U^ eg iPost a .m (£) = . It is a contradiction 
with |A(C P )| = 1. 

Based on Proposition [T] for the transient states (k,£), the probability X n ((k,£)) vanishes for n — > oo. 
Since for all g G G, we have |g| = 1, the support of X„ (« > m) contains only one recurrent state. Thus, 
the probability mass accumulates in that state: for all e > 0, for all n > no there is a state (i,£) with 
X n ((i,£)) > 1 — e, that is ||X"|| > 1 — e. Hence, lim„^oo||X"|| = 1 and M' is strongly synchronizing. 
Therefore, so is the MDP M under the strategy a. 

Necessary condition. Assume that the MDP M with strategy a is strongly synchronizing. Then Ve > 
• 3no G N • V« > no ■ 3q„ such that X" (q„) > 1 — £. Moreover the state q„ is unique, and we show below 
that it is independent of e (assuming e < j). 

Let V be the smallest probability among all probability distributions of the MDP M (i.e., V = 
rmn feL,CTeEi'eSupp(5(f,ff))(^(^) Let £ < We claim that for all n > no, there exists some 
action a G T such that Post a (q n ) = {q«+i} is a singleton. Toward contradiction, assume that for all 
a G E, the statement Post a (c\ n ) ^ {q«+i} is satisfied. The probability which does not inject to q„ + i 
(from q n ) is at least v • (1 — e). And since M is strongly synchronizing, we have: 

l-e< ||X,f +1 || <l-v-(l-e) 
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This gives e > which is a contradiction. Therefore, for all n > jiq, there exists a G £ such that 
Post a (q n ) = {q, 1+ i}. This implies that the infinite sequence of states / = q„ q« +i ... is uniquely defined. 

The sequence / is used to define a pure synchronizing strategy /3 from the randomized synchronizing 
strategy a. This construction implies that the pure strategies are sufficient for strongly synchronizing 
objectives. We define the pure strategy j8 as follows: 

• for h G Hists(M) with \h\ = i and Last(/j) = q,-, we define j6(/z) = a where Post a (Last(h)) = {q, + i}, 

• for h G Hists(M) with \h\ = i and Last(/i) 7^ q,, we define {5(h) = Action (/i',/) 

where h' G Hists(M) is the shortest possible history such that (1) State(//, i) = Last(/i), (2) Pr a (h') 
> 0, and (3) Last(//) = q ; - with \h'\ = j. One might notice that a reachable state Last(/i) with a 
strictly positive probability Pr a (h) > 0, has to access a state of / (such as Last(/i') = q 7 - where \h'\ = 
j); otherwise the MDP M with strategy a, would not be strongly synchronizing. Consequently, 
the history h' defined above always exists. 

As a result, we can define SizePath(/i) = \h'\ — \h\ to be the size of shortest path from Last(/i) to the 
infinite sequence /. Note that for h with \h\ = i and Last(/j) = q,-, we define SizePath(/i) = 1. It is easy to 
see that the MDP M with pure strategy j8 is also strongly synchronizing. 

In the following, we show that there exists a cycle C p of M p which has only one recurrent cyclic set 
G, and all g € G are singleton. By construction, we have p(h) = j3(/i') for all histories h,h' € Hists(M) 
with Last(/j) = Last(/z') and \h\ = \h'\. Therefore the pure strategy /3 induces an infinite path Pp, in the 
perfect-information subset construction M p . Since the state space of M p is finite, some cell S has to 
be visited infinitely many times along Pp. The path between two visits to S along Pp is a cycle (not 
necessarily a simple cycle) of M p . We study one of the these cycles (starting at S and coming back 
there), and prove that this cycle satisfies the conditions of the theorem. 

Let I nf (/) denotet the set of all states visited infinitely often along /. Hence, there exists A^| n f > «o 
such that V/ > N\ n f : q,- G / => q,- G Inf (/). Let be the first step after A^| n f in which the path Pa visits S. 
Let MaxPath=max heH - lsts ^ Pr ^ h ^ >0 ^ =Kj (SizePath(h)), be the length of the longest path (among the 
shortest ones) from one reachable state at step K\ , to the infinite sequence /. 

Let C p be the cycle starting in S at step K\, and coming back to this state in some step K2 > 
Ki+MaxPath. We claim that this cycle C p has only one recurrent cyclic set G, and all subsets g G G 
are singleton: 

1. G = {{q,} I q; G / for K\ <i < K2} is a recurrent cyclic set. We already have proved that there 
exists a G £ such that Post a (q n ) = {q„+i} (n > no). Note that for state q n , the action a is chosen 
by the cycle. 

2. G is the only recurrent cyclic set. Each state included in S reaches, in at most MaxPath steps, one 
state of /. Hence, the cell S, as the first element of C p , cannot have another subset g' constructing 
another recurrent cyclic set. 

We have proved that for a strongly synchronizing MDP M, the perfect information subset construction 
for M, has a cycle C p such that |A(C P )| = 1, and for G G A(C P ) and for all g G G, \g\ = 1. 

□ 

Through the proof of Theorem [T] we have seen that for all strategies a such that an MDP M with the 
strategy a is strongly synchronizing, there is a pure strategy that satisfies the strongly synchronization 
condition. We will see that this is also the case for weakly synchronizing objective (see the proof of 
Theorem [2]). 
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Corollary 1 For both strongly and weakly synchronizing objectives, pure strategies are sufficient in 
MDPs. 

Theorem 2 For a perfect information game over an MDP M, there exists a strategy a such that M with 
strategy a is weakly synchronizing, if and only if the perfect-information subset construction M p forM, 
has an accessible cycle C p such that |A(C P )| = 1, and for G G A(C P ), there exists g G G such that \g\ = 1. 

Proof Sufficient condition. We suppose that the perfect-information subset construction M p for M, has 
an accessible cycle C p such that |A(C P )| = 1, and for G G A(C P ), there exists g G G such that \g\ = 1. 
Consider a pure strategy similarly to which presented in proof of Theorem [T] Let us, here as well, 
construct the Markov chain M', and therefore discuss on transient and recurrent states of M'. 

Suppose that G G A(C P ) is the only recurrent cyclic set of the cycle, and it includes d elements as 
go,--- gd-\- Let R be the set of states (m + i, £) such that £ G gu for < i < d. As we have shown in 
proof of Theorem [I] the states of R are the only recurrent states in the Markov chain M' . Let p n be 
the probability to be in one state of R at step n. Based on Proposition [T| for the transient states (i,£) 
the probability X n ((i,£)) vanishes for n — > °°, which leads to lim n ^^p n = 1. On the other hand, by 
hypothesis, for G G A(C P ) there exists gj G G (0 < j < d) such that \gj\ = 1. Then every d steps, at 
least once, the probability p m+ k-d+j gathers in only one state (m + j,£) where £ G gj. As a result, for 
all k G N, ^^(\\^m+d.k\\A\ X m+d.k+ih---\\ X m+d.k+d-l ID ^ Pm+k-d+j- We have shown that lim n ^p n = 1, 
hence limSup n ^ \\X®\\ = 1. 

Necessary condition. Assume that the MDP M with strategy a is weakly synchronizing meaning that 
limSup n ^oo \\X"\\ = 1. Therefore there exists a subsequence ||X"|| of which approaches to 1 (i.e., 
lim^oo \ \X"\\ = 1), where io < i\ < ?2 < • • ■ is an increasing sequence of indices. Then, for e < 1/2 there 
exists no G N such that for all n > no there exists a (unique) state £ such that X"(£) > 1/2. Let (£,i n ) 
refer to this unique state at position i n . Let Inf be the set of all states £ such that X"((£,i n ) > 1/2 for 
infinitely many n G N. 

Hence, there exists Nm > no such that Vrc > N\ n f : X"((£, i„) > 1/2 =>• £ G Inf. 

Since the state space of the MDP is finite, for a specific q G Inf, we can define a subsequence (jk)ken) 
of (zfc)fceN such that 

1- jo>N\„f, and 

2. X?((q,7 t ) > 1/2, and 

3. Supp(Z") = Supp(X" +i ); in the sequel, we denote to this set by S. 

Let (q, jk) refer to the state q at specific step jk, and / be the sequence of this states. Note that since 
jk is a subsequence of we have limk^^ \ |X"| | = 1 as well. 

We use the infinite sequence J to construct a winning pure strategy from the winning randomized 
strategy a. Consider the pure strategy j8 as follows. For h G Hists(M) with \h\ = i, we define j8(/i) = 
Action (h',i) where 

h! G Hists(M) is the shortest possible history such that (1) Pr a (h') > 0, (2) Last(Zi') = (q, jk) where 
\h'\ = jk for some k G N, and in addition (3) State(/i', i) = Last(/j). One might notice that a reachable state 
Last(/i) with a strictly positive probability Pr a (h) > 0, has to access the infinite sequence /; otherwise 
the MDP M with strategy a would not be weakly synchronizing. Consequently, the history h' defined 
above always exists. 

Similarly to the case of strongly synchronizing, we can define SizePafh(/j) = \h'\ — \h\ to be the size 
of shortest path from Last(ft) to the infinite sequence J. 



72 



Synchronizing objectives for Markov Decision Processes 



In the following, we show that for a weakly synchronizing MDP M, there exists a cycle C of M 
which has only one recurrent cyclic set G, and there exists g G G which is singleton. By construction, 
we have fi(h) = f}(h') for all histories h,ti G Hists(M) with Last(fc) = Last(fc') and \h\ = \h'\. Therefore 
the pure strategy /3 induces an infinite path Pp in the perfect-information subset construction M p . The 
construction of j8, also implies that the cell S is visited infinitely many times along Pp. The path taken 
between two visits to S along Pp is a cycle (not necessarily a simple cycle) of M p . We study one of 
these cycles (starting at S and coming back there), and prove that this cycle satisfies the conditions of the 
theorem. 

Let to be the first step after 7Y| n f in which the path Pp visits S. Let us define MaxPath = 
max^ € Hists(M),/ , r(A)>o,|*|=Jfi(SizePath(A)) to be the length of the longest path (among the shortest ones) 
from a reachable state at step K\ to the infinite sequence /. 

Let C p be the cycle starting in S at step K\, and coming back to this state in some step K% > 
Ki+MaxPath. For convenience, let d = K 2 — K\ denote the length of the cycle C p . We define the 
winning pure strategy /$' from the strategy j3 as follows. 

• for h G Hists(M) with \h\ < K x + K 2 , we define p'(h) = f5(h). 

• for h G Hists(M) with \h\ > K\ + K 2 , we define f3'(h) = fi(h') where \h\ = d-m + \h'\ for some 
m G N, and ti is a history with Ki < \h'\ < Ki +K 2 and Last(fc) = Last(A'). 

In fact, the path corresponding to the strategy J3' first reaches the cycle C p , and then forever follows 
this cycle. The strategy j8' as well as the strategy j8 is weakly synchronizing. We claim that this cycle 
C p = so do ■ ■ ■ Sd (so = so) has only one recurrent cyclic set G, and there exists gGG which is singleton: 

1. First we prove that this cycle has one recurrent cyclic set. The size of the cycle is more than 
MaxPath which shows that some elements of the infinite sequence J are visited along the cycle. 
Suppose that (q, jV) is the last visited state of / along the cycle, and K' = — K\ is the index of 
cell sk> including this state. Let us construct a singleton subset gx> = {(q,K')}. By induction, let 
g(K>+i+i) mod d = Ufeg (x , +i) mod d Post d(K , +i) moid {l) for all 0<i<d. Note that, for i = d—l, the set 
gK> is computed. By definition, the set G = {go, gi, ■■■ , gd-i} is a recurrent cyclic set, if after 
the computation, we still have g K > = {(q,K')}. 

We claim that g K i = {(q,K')}. By contradiction, suppose that g K > ^ {(q,K')} is satisfied. We 

have limsup^^Hxf || = 1. Then V£ > • 3n G N • Vn > n ■ 31 such that x£' (£) > 1 - £. On 
the other hand, by definition of J, we know that the mass of probability in states of J are more 
than 1/2, and in addition we know that all states of the cycle inject probability to /; these show 
that the visited states of / along the cycle are candidates to concentrate the probability mass. 
Let v be the smallest probability among all probability distributions of the MDP M (i.e., v = 
miriteL,CTGE/eSup P (5(Ac7))( 5 (^' cr )(^)))- Let £ < 1^7. andX,f ((q,K')) > 1 -e where n > n . The 
probability which does not inject to (q,K r ) (from (q,K') after d steps), is at least v d ■ (1 — e). We 
have: 

l-e<X^ d ((q,K'))<\-v d -(\-e) 

This gives s > which is a contradiction. Therefore, we have constructed a recurrent cyclic 
set G for the cycle, and have shown that one element of G is singleton. 

2. We can see that G is the only recurrent cyclic set. Each state included in S reaches, at most in 
MaxPath steps, one state of J. Hence, the cell S, as the first element of C p , can not have another 
subset g' constructing another recurrent cyclic set. 
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Figure 5: A weakly synchronizing MDP. 



We have proved that for a weakly synchronizing MDP M, the perfect information subset construction for 
M, has a cycle C p such that |A(C P )| = 1, and for G G A(C P ), there exists g G G such that \g\ = 1. 

□ 

Example The MDP depicted in Figure [5] (the initial distribution is /lo(l) = 1 and ;Uo(i) = for i G 
{2, . . . ,5}) with strategy a which defined as follows a((L x £)*L)(a) = 1/2 for a G S, is weakly syn- 
chronizing. Note that L = {1,...,9}, £ = {ai,a 2 }. 



4 Synchronizing objectives for Blind strategies 

We have defined a blind one-player stochastic game where the player is not allowed to observe the current 
state of the game. We use a characterization of synchronizing blind strategies to show that the existence 
of synchronizing blind strategies can be decided. We first present an example where the player is blind 
and has a strategy to make the game synchronizing. 

Example The MDP depicted in Figure [6] (the initial distribution is /io(l) = 1 and jUo(i) = for i G 
{2, . . . ,5}) with blind strategy a which defined as following a((L x £)*L)(a) = 1/2 for a G {£} is 
strongly synchronizing. Note that L = {1, . . . = {01,02}. 

Theorem 3 For a blind game over an MDP M, there exists a strategy a such that M with strategy a is 
strongly synchronizing, if and only if the blind subset construction M B for M, has an accessible cycle 
C B such that |A(C B )| = 1, and for G G A(C B ) and for all g G G, \g\ = 1. 




Figure 6: An MDP that with some blind strategy is strongly synchronizing. 
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Proof Sufficient condition. We suppose that the blind subset construction M B = (^f,Lj,L,8 B ) for M, 
has an accessible cycle C B such that |A(C B )| = 1, and for G G A(C B ) and for allg G G, \g\ = 1. Since this 
cycle is accessible, then there exists a finite path P = po o' Q ... ct^_ 2 Pm-\G m -\ Pm in ^ B from /?o = Lj 
to p m = so. Consider the pure blind strategy a as follows 

f a( ifO<£<m, 
1 °{k-m) mod d tfm<k. 

Let us, construct a Markov chain M' similar to which presented in proof of Theorem [I] with the 
below probability function: the probability transition function 8' is defined as follows 

' 5(£,o!)(£') if (0 < i < m), (i' = i + l), (£ G pi) and {£' G p f ), 
8 (£, Ci-m) {£') if (m < i < m +d), (/' = m + (i — m + 1) mod d) , 

(£ G 5i- m ) and (f G ty-m), 

otherwise. 

Suppose that G G A(C B ) is the only recurrent cyclic set of the cycle, and it includes d elements as 
go, ■ --gd-i- Let R be the set of states (m + i,£) such that I G gi, for < i < d. As we have shown in proof 
of Theorem [T] the states of 7? are the only recurrent states in the Markov chain M 1 . 

Based on Proposition [T] for the transient states (k,£), the probability X n ((k,£)) vanishes for n — > oo. 
Since for all g G G, we have \g\ = 1, the support of X n (n > m) contains only one recurrent state. Thus, 
the probability mass accumulates in that state: for all e > 0, for all n > no there is a state (z, £) with 
X n ((i,£)) > 1 — e, that is > 1 — e. Hence, lim„^oo||X"|| = 1 and M' is strongly synchronizing. 

Therefore, so is the MDP M under the blind strategy a. 

Necessary condition. We benefit from arguments presented in Proof of Theorem [TJ but here since the 
winning strategy is blind, we use blind subset constructions. 

□ 

Theorem 4 For a blind game over an MDP M, there exists a strategy a such that M with strategy a is 
weakly synchronizing, if and only if the blind subset construction M B forM, has an accessible cycle C B 
such that | A(C B ) | = 1, and for G G A(C B ), there exists g G G such that \g\ = 1. 

Proof Sufficient condition. We suppose that the blind subset construction M B = (Jf,L/,£, 5 B ) for M, 
has an accessible cycle C B such that |A(C B )| = 1, and for G G A(C B ), there exists g G G such that \g\ = 1. 

Consider a pure strategy similarly to which presented in proof of Theorem [T] Let us, here as well, 
construct the Markov chain M', and therefore discuss on transient and recurrent states of M'. 

Suppose that G G A(C B ) is the only recurrent cyclic set of the cycle, and it includes d elements as 
go,--- gd-\- Let R be the set of states (m + i, t) such that £ G gi, for < i < d. As we have shown in 
proof of Theorem [TJ the states of R are the only recurrent states in the Markov chain M' . Suppose that 
p is the probability to be in one state of R at step n. Based on Proposition [T] for the transient states 
(i,£), the probability X n ((i,£)) vanishes for n — » oo, which leads linin^^p = 1. On the other hand, by 
hypothesis, for G G A(C B ), there exists gj G G (0 < j < d) such that \gj\ = 1. Then every d steps, at 
least once, the whole of probability p gathers in only one state (ra + j,£) where £ G gj. As a result, for 
all k G N, rnax(||X^ +rfJfc ||,||^ +djt+1 ||,...||X^ +rfjfc+(i _ 1 ||) > p. We have shown that limn^p = 1, hence 
limSupn^co = 1. 

Necessary condition. We benefit from arguments presented in Proof of Theorem [2j but here since the 
winning strategy is blind, we use blind subset constructions. 
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□ 

From the four previous theorems, we obtain the following result. 

Theorem 5 The problem of deciding the existence of a {perfect-information, blind} strategy in MDPs 
for a {strongly, weakly} synchronizing objective is decidable. 

We have defined a new class of objectives for Markov decision processes, and we have given a de- 
cidable characterization of winning strategies for these objectives. Further investigations will be devoted 
to studying the precise complexity of the problem, establishing memory bounds, and extending this 
framework to partially-observable MDPs and stochastic two-player games. 
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